Creating data visualisation beyond default
In this take-home exercise, I will be exploring and revealing the demographic of the city of Engagement, Ohio USA by using appropriate static statistical graphics methods. The data will be processed by using appropriate tidyverse family of packages and the statistical graphics will be done with ggplot2 and its extensions. Datasets used will be taken from the VAST Challenge 2022.
Before I get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, I will install the R packages and load them onto R environment.
The chunk code below will do the trick.
packages = c('tidyverse','psych','plotly')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below imports Participants.csv from the data
folder, into R by using read_csv()
of readr and
save it as an tibble dataframe called participants_data.
participants_data <- read_csv("data/Participants.csv")
After importing the Participants.csv, I used the function glimpse()
of dplyr,
like its name suggests, to get a glimpse of the data that I am are
working on.
glimpse(participants_data)
Rows: 1,011
Columns: 7
$ participantId <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~
From the output, we know that the dataset consists of 1011 unique participants and has 7 different columns of metadata related to the participants.
It seems like the data in column ‘participantId’ are just unique tags given to each participant in this study, hence is likely not useful in this preliminary analysis.
The data in columns ‘age’ and ‘joviality’ appears to be continuous data type, while data in column ‘householdSize’ is likely to be a discrete data type.
The data in columns ‘haveKids’, ‘educationLevel’
and ‘interestGroup’ appears to be categorical data type. With
the above initial observations, I would be able to see what are the
unique values and count in these categorical and discrete data type
columns.
table(participants_data$householdSize)
1 2 3
337 373 301
For ‘householdSize’, there are 3 unique values of 1, 2 and
3. Also, the count of these unique values appears to be quite even split
out.
table(participants_data$haveKids)
FALSE TRUE
710 301
For ‘haveKids’, it consists of either True or False
(boolean), with around 70% of the participants not having kids.
table(participants_data$educationLevel)
Bachelors Graduate HighSchoolOrCollege
232 170 525
Low
84
For ‘educationLevel’, there are 4 unique types, namely
Bachelors, Graduate, High School/College and Low. A quick glance tells
us that bulk of the participants (~50%) are only having high
school/college qualifications. Lowly educated participants are minority
in this case.
table(participants_data$interestGroup)
A B C D E F G H I J
102 91 102 96 83 106 108 111 96 116
For ‘interestGroup’, there are 10 unique types, named from A
to J (actual interest group names are redacted). A high level glance at
the numbers tells us that the number of participants in the various
interest groups are quite evenly split too.
For curiosity, the describe()
of psych
package, was used to get a brief statisitical understanding of the
dataset too.
des <- describe(participants_data, fast = TRUE)
print(des, digits=5)
vars n mean sd min max
participantId 1 1011 505.00000 291.99486 0.0000 1010.00000
householdSize 2 1011 1.96439 0.79399 1.0000 3.00000
haveKids 3 1011 NaN NA Inf -Inf
age 4 1011 39.07418 12.37930 18.0000 60.00000
educationLevel 5 1011 NaN NA Inf -Inf
interestGroup 6 1011 NaN NA Inf -Inf
joviality 7 1011 0.49379 0.29135 0.0002 0.99923
range se
participantId 1010.00000 9.18332
householdSize 2.00000 0.02497
haveKids -Inf NA
age 42.00000 0.38933
educationLevel -Inf NA
interestGroup -Inf NA
joviality 0.99903 0.00916
From the output, the observations are,
The code chunk below plots a histogram by using geom_histogram()
of ggplot2.
ggplot(data = participants_data,
aes(x=age)) +
geom_histogram(bins=20, fill = 'light blue', color='black') +
ggtitle("Histogram of Participants' age") +
xlab('Age') +
ylab('Count') +
scale_x_continuous(breaks = seq(10, 70, by = 2)) +
scale_y_continuous(breaks = seq(0, 100, by = 5))
From the above chart, we are able to see the distribution of the participants across the various age bins.
From this simple histogram, we can see that participants of age 18-19 are the minority, while participants of age 30-31 have the largest proportion within this dataset population. Other notable age groups with higher proportion are of age 42-43 and 52-53.
Understanding that ggplot also allows us to fill in the chart with another additional parameter, I’ve added the ‘fill’ parameter into the code chunk that would embed the ‘educationLevel’ data into the chart too, as seen below.
ggplot(data = participants_data,
aes(x=age, fill= educationLevel)) +
geom_histogram(bins=20, color='black') +
ggtitle("Histogram of Participants' age, filled by Education Level") +
xlab('Age') +
ylab('Count') +
labs(fill="Education Level") +
scale_x_continuous(breaks = seq(10, 70, by = 2)) +
scale_y_continuous(breaks = seq(0, 100, by = 5))
Now we can see the distribution of the various academic qualifications of the participants across the age histogram. Visually, it is obvious that within the age groups with larger proportion (eg. age 30-31 and 52-53) also have a larger proportion of them with high school/college qualifications.
With respect to the larger age 30-31 group, assuming that the dataset is dated for 2022, it would mean that these participants were born in around the year 1991-1992. Coinciding with this time period was the early 1990s economic recession era.
Incidentally, if we look at the peak groups (age 41-42, 52-53), the year of birth from these participants are 1980-1981 and 1969-1970, which also coincide with the economic recession period back then.
Assuming my assumption is correct, throughout, we can also see that after these spikes in births, the subsequent years saw a relatively sharp decline in numbers before creeping up gradually.
Usually, birth rates declined after economic crisis happens. However, more cross-referencing data is required to reveal more and confirm on interesting observations.
A additional tweak applied was to use plotly package to create interactive charts, as seen below.
ggplotly(ggplot(data = participants_data,
aes(x=age, fill= educationLevel)) +
geom_histogram(bins=20, color='black') +
ggtitle("Histogram of Participants' age, filled by Education Level") +
xlab('Age') +
ylab('Count') +
labs(fill="Education Level") +
scale_x_continuous(breaks = seq(10, 70, by = 2)) +
scale_y_continuous(breaks = seq(0, 100, by = 10)))
With this, we are now able to recieve more microdata (eg. count and average) regarding the different sub-groups in the chart as we hover our cursor over them.
While the previous chart combines everything into 1 chart, supposedly for a 1-stop chart solution, the distribution and count of the participants may not be that obvious.
Hence the next chart includes facet_wrap()
into the code chunk, which splits out the output into separate
mini-charts grouped by the Education Level data.
ggplotly(ggplot(data = participants_data,
aes(x=age, fill= educationLevel)) +
geom_histogram(bins=20, color='black') +
ggtitle("Histogram of Participants' age, filled by Education Level") +
facet_wrap(~educationLevel) +
xlab('Age') +
ylab('Count') +
labs(fill="Education Level") +
scale_x_continuous(breaks = seq(10, 70, by = 5)) +
scale_y_continuous(breaks = seq(0, 100, by = 10)))
Now, it will be clearer to see the distribution of the education qualification of the participants across various ages.
Now that we have established the rough idea of how I am going to plot the charts, let’s move on to explore the demographics with a slight change in the analysed parameter.
Here we will be replacing the ‘age’ column data in x, with ‘joviality’ column data.
ggplotly(ggplot(data = participants_data,
aes(x=joviality, fill= educationLevel)) +
geom_histogram(bins=10, color='black') +
ggtitle("Histogram of Participants' joviality, filled by Education Level") +
xlab('Joviality') +
ylab('Count') +
labs(fill="Education Level") +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_y_continuous(breaks = seq(0, 140, by = 10)))
Based on the chart, it appears that there is a good spread of participants (with varying academic qualifications) across, except for the two extreme ends (left being the very unhappy group, right being the seriously happy group). Proportion wise, it seems like education status does not have much impact on the happiness level.
The same chart is plotted with plotly package.
ggplotly(ggplot(data = participants_data,
aes(x=joviality, fill= educationLevel)) +
geom_histogram(bins=20, color='black') +
ggtitle("Histogram of Participants' joviality, filled by Education Level") +
facet_wrap(~educationLevel) +
xlab('Joviality') +
ylab('Count') +
labs(fill="Education Level") +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_y_continuous(breaks = seq(0, 140, by = 10)))
Next up, a chart was plotted to see the distribution of participants of various ages with their respective proportion of whether they have kids.
ggplotly(ggplot(data = participants_data,
aes(x=age, fill= haveKids)) +
geom_histogram(bins=20, color='black')+
ggtitle("Histogram of Participants' age, filled by whether they have kids") +
xlab('Age') +
ylab('Count') +
labs(fill="Have Kids") +
scale_x_continuous(breaks = seq(10, 70, by = 2)) +
scale_y_continuous(breaks = seq(0, 140, by = 10)))
Coincidentally, the larger groups for age 30-31, 41-42 and 52-53, seems to have a larger proportion of them not having kids. If we were to make a bold assumption that birth bearing age at ~25 to 30 years old, it seems like these group of participants were also experiencing a economic recession during their prime child bearing age.
Example, for the age group of 52-53, it was the year of 2000 when another economic recession happened when these participants were around 30 years old.
A scatter plot was charted with participants’ age and joviality. It seems like there is no correlation at all between the participants’ age and their happiness level.
ggplot(data = participants_data,
aes(x=age, y= joviality)) +
geom_point() +
geom_smooth(size=0.5) +
ggtitle("Scatter Plot of Participants' Joviality vs Age") +
xlab('Age') +
ylab('Joviality') +
scale_x_continuous(breaks = seq(10, 70, by = 2)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1))
Next up, a boxplot was plotted with geom_boxplot()
and geom_point(),
with facet_wrap()
as below.
ggplot(data = participants_data,
aes(y=age, x = interestGroup)) +
geom_boxplot() +
geom_point(stat='summary',
fun.y='mean',
colour='red',
size=2) +
facet_wrap(educationLevel~.) +
ggtitle("Box Plot of Participants' Age across various Interest Group, grouped by Education Level") +
xlab('Interest Group') +
ylab('Age')
Some key observations are,
While the actual interest group type/names are redacted, this insight does tell us how different age group of various education status has a impact on the kind of interest groups they are likely to be in.
Following, we will look at the same boxplots but with respect to Joviality instead of Age.
ggplot(data = participants_data,
aes(y=joviality, x = interestGroup)) +
geom_boxplot() +
geom_point(stat='summary',
fun.y='mean',
colour='red',
size=2) +
facet_wrap(educationLevel~.) +
ggtitle("Box Plot of Participants' Joviality across Interest Group, grouped by Education") +
xlab('Interest Group') +
ylab('Joviality')
Some key observations are,
In our next boxplot, I’ve added another dimension (whether the particpants have kids or not) to see if there is any more interesting observations.
ggplot(data = participants_data,
aes(y=age, x = haveKids)) +
geom_boxplot() +
ggtitle("Box Plot of Participants' age across various Interest Group, grouped by Education") +
xlab('Have Kids?') +
ylab('Age') +
facet_grid(interestGroup~educationLevel) +
coord_flip()
Some key observations are,
While it may not be intuitive now, but I believe with deeper analysis, we may be able to connect the dots and uncover some insights.
With that, this is the end of my take-home exercise 1. =)